# load libraries ----
# reporting
library(knitr)
# visualization
library(ggplot2)
library(ggthemes)
library(patchwork)
# data wrangling
library(dplyr)
library(tidyr)
library(readr)
library(skimr)
library(janitor)
library(magrittr)
library(lubridate)
# modeling
library(corrplot)
library(GGally)
# set other options ----
# scientific notation
options(scipen=999)
# turn off messages and warnings
knitr::opts_chunk$set(
tidy = FALSE,
message = FALSE,
warning = FALSE)
# set the ggplot theme for the document
theme_set(theme_bw())25 Predictor variables
Learning Objectives
After completing this tutorial you should be able to
- distinguish between outcome and predictor variables (features).
- outline characteristics required to compile appropriate data sets to train a predictive model.
- evaluate the correlation among predictor variables using
corrplot.
Download the directory for this project here, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.
There should be a file named 25_predictors.qmd in that project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set1 you may of course add as many comments as you need to be able to recall what you did. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.
1 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.
You may need to install a few additional packages before we get started. Check the code chunk below and install any packages you still need.
25.1 Machine learning and predictive modeling
We are going to use machine learning for to build and train our predictive modeling.
You might recall from your biostats class that typically we draw a sample from a population and then we use what we learn from the sample to make inferences about the population that they represent, i.e. we can infer the average value of an observation in the (unsampled) population based on the sample2.
2 Recall, for linear regressions we use the formula “on average for an increase of one unit of the independent variable we expect the independent variable to increase by the value of the slope” to describe out results.
For prediction we also start with a sample, however, in this case the goal is to be able to predict a single observation’s value of a given characteristic based on the characteristics of the observations in the sample - our goal is not to make a statement about what the average expectation for the value of an observation is, we want to predict the actual value for that observation.
In the case at hand, we have our continuous outcome variable that we want to predict (air pollution levels) which we will denote as \(Y\). \(X\) would be our set of predictor variables with \(X_{1}, ... X_{p}\) denoting individual features (development, population density …).
Our goal is to describe a rule (machine learning algorithm) that takes values for our features as input and the predicts the air pollution (outcome variable) for situations where air pollution is unknown (\(\hat{Y}\)) . This process is what we generally describe as training a model and means that we are using a data set where both \(X\) and \(Y\) are known to estimate a function \(\hat{Y} = f(X)\) that uses the predictor variables as input.
The better our model, the more closely our predicted outcome \(\hat{Y}\) should match our actual outcome \(Y\). In other words we are trying to minimize the distance between \(\hat{Y} = f(X)\) and \(Y\)3. The choice of the distance metric (\(d(Y - f(X))\)) is frequently the absolute or squared difference.
3 This is what is referred to as an optimization problem.
In order to set up (and then solve!) a typical ML problem we need four components:
- a training data set (i.e. matching data set of outcome and predictor variables)
- a (set of) algorithms to try values of \(f\).
- A distance metric \(d\) to measure how closely \(Y\) and \(\hat{Y}\) match.
- A definition of a “good” (acceptable) distance.
25.2 Data for predictive modeling
We will use a data set gathered using air pollution monitoring. In this system of monitors about 90% are located within cities leading to rural areas being severely under-monitored. Our goal is to use this data set to train model that can accurately predict air pollution levels even when physical monitors are not present. A primary interest in air pollution is due to the adverse health outcomes related to exposure.
Before we start, we should consider limitations of the data set we will look at to make sure that we can answer our question and to make sure that we don’t overstep in our interpretation.
We will be using supervised machine learning to predict pollution levels. To do this, we need two main types of data:
- A continuous outcome variable - this is the variable we want to predict.
- A set of features or predictor variables used to predict the outcome variable.
In order to build and train a model you need corresponding data sets4. The underlying principle is that if you determine the existing relationship and describe it mathematically using an existing data set, then you would also be able predict the value for that outcome variable for a new observation as long as you have values for the predictor variable.
4 Corresponding means they should have the same/very similar spatial and temporal resolution.
With the rise of computational power available at our fingertips, Machine Learning and Artificial Intelligence approaches are increasingly being used to solve problems, especially when large data sets are involved. Unfortunately, this means it quickly turns into a black box where we dump in some outcome and predictor variables give it a good shake and take the “answer” we receive at face value.
To avoid this we need to start with a specific question in mind and carefully consider how our outcome and features are related. Good questions have a plausible explanation for why features predict the outcome and critically evaluate potential for variation in both the features and outcome over time.
We will be using a data set that was previously compiled by Roger Peng, a scientist who studies air pollution, climate change and public health. We will import a data set to work with that has already combined monitoring data and possible predictor variables that have been compiled by a range of sources.
The monitoring data comes from gravimetric monitors operated by the EPA that are designed to capture fine particulate matter (PM2.5) using a filtration system. Values are measure either daily or weekly. Our feature data set contains values for each of the 876 monitors (observations) and has been compiled from the EPA, NASA, US Census and National Center for Health Statistics. Most of the features have been taken for a circular area that surround the monitor itself (buffer).
For this module we will explore geographic patterns of air pollution and determine whether we can predict levels of pollution for areas where we do not have monitors based on available information from areas that do have monitors. Specifically, we will answer the central question Can we predict annual average air pollution concentrations at the resolution of zip code regional levels using predictor variables describing population density, urbanization, road density, satellite pollution data, and chemical modeling data?
25.3 Explore the data set
Let’s import the data set.
pm <- read_delim("data/air_pollution.csv", delim = ",") %>%
clean_names()Let’s take a quick look at the data types to make sure everything read in correctly.
pm %>%
glimpse()Rows: 876
Columns: 50
$ id <dbl> 1003.001, 1027.000, 1033.100, 1049.100, 10…
$ value <dbl> 9.597647, 10.800000, 11.212174, 11.659091,…
$ fips <dbl> 1003, 1027, 1033, 1049, 1055, 1069, 1073, …
$ lat <dbl> 30.49800, 33.28126, 34.75878, 34.28763, 33…
$ lon <dbl> -87.88141, -85.80218, -87.65056, -85.96830…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama"…
$ county <chr> "Baldwin", "Clay", "Colbert", "DeKalb", "E…
$ city <chr> "Fairhope", "Ashland", "Muscle Shoals", "C…
$ cmaq <dbl> 8.098836, 9.766208, 9.402679, 8.534772, 9.…
$ zcta <dbl> 36532, 36251, 35660, 35962, 35901, 36303, …
$ zcta_area <dbl> 190980522, 374132430, 16716984, 203836235,…
$ zcta_pop <dbl> 27829, 5103, 9042, 8300, 20045, 30217, 901…
$ imp_a500 <dbl> 0.01730104, 1.96972318, 19.17301038, 5.782…
$ imp_a1000 <dbl> 1.4096021, 0.8531574, 11.1448962, 3.867647…
$ imp_a5000 <dbl> 3.3360118, 0.9851479, 15.1786154, 1.231141…
$ imp_a10000 <dbl> 1.9879187, 0.5208189, 9.7253870, 1.0316469…
$ imp_a15000 <dbl> 1.4386207, 0.3359198, 5.2472094, 0.9730444…
$ county_area <dbl> 4117521611, 1564252280, 1534877333, 201266…
$ county_pop <dbl> 182265, 13932, 54428, 71109, 104430, 10154…
$ log_dist_to_prisec <dbl> 4.648181, 7.219907, 5.760131, 3.721489, 5.…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 8.517193, 8.517193, 9.…
$ log_pri_length_10000 <dbl> 9.210340, 9.210340, 9.274303, 10.409411, 1…
$ log_pri_length_15000 <dbl> 9.630228, 9.615805, 9.658899, 11.173626, 1…
$ log_pri_length_25000 <dbl> 11.32735, 10.12663, 10.15769, 11.90959, 12…
$ log_prisec_length_500 <dbl> 7.295356, 6.214608, 8.611945, 7.310155, 8.…
$ log_prisec_length_1000 <dbl> 8.195119, 7.600902, 9.735569, 8.585843, 9.…
$ log_prisec_length_5000 <dbl> 10.815042, 10.170878, 11.770407, 10.214200…
$ log_prisec_length_10000 <dbl> 11.88680, 11.40554, 12.84066, 11.50894, 12…
$ log_prisec_length_15000 <dbl> 12.205723, 12.042963, 13.282656, 12.353663…
$ log_prisec_length_25000 <dbl> 13.41395, 12.79980, 13.79973, 13.55979, 13…
$ log_nei_2008_pm25_sum_10000 <dbl> 0.318035438, 3.218632928, 6.573127301, 0.0…
$ log_nei_2008_pm25_sum_15000 <dbl> 1.967358961, 3.218632928, 6.581917457, 3.2…
$ log_nei_2008_pm25_sum_25000 <dbl> 5.067308, 3.218633, 6.875900, 4.887665, 4.…
$ log_nei_2008_pm10_sum_10000 <dbl> 1.35588511, 3.31111648, 6.69187313, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 2.26783411, 3.31111648, 6.70127741, 3.3500…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.628728, 3.311116, 7.148858, 5.171920, 4.…
$ popdens_county <dbl> 44.265706, 8.906492, 35.460814, 35.330814,…
$ popdens_zcta <dbl> 145.716431, 13.639555, 540.887040, 40.7189…
$ nohs <dbl> 3.3, 11.6, 7.3, 14.3, 4.3, 5.8, 7.1, 2.7, …
$ somehs <dbl> 4.9, 19.1, 15.8, 16.7, 13.3, 11.6, 17.1, 6…
$ hs <dbl> 25.1, 33.9, 30.6, 35.0, 27.8, 29.8, 37.2, …
$ somecollege <dbl> 19.7, 18.8, 20.9, 14.9, 29.2, 21.4, 23.5, …
$ associate <dbl> 8.2, 8.0, 7.6, 5.5, 10.1, 7.9, 7.3, 8.0, 4…
$ bachelor <dbl> 25.3, 5.5, 12.7, 7.9, 10.0, 13.7, 5.9, 17.…
$ grad <dbl> 13.5, 3.1, 5.1, 5.8, 5.4, 9.8, 2.0, 8.7, 2…
$ pov <dbl> 6.1, 19.5, 19.0, 13.8, 8.8, 15.6, 25.5, 7.…
$ hs_orless <dbl> 33.3, 64.6, 53.7, 66.0, 45.4, 47.2, 61.4, …
$ urc2013 <dbl> 4, 6, 4, 6, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ urc2006 <dbl> 5, 6, 4, 5, 4, 4, 1, 1, 1, 1, 1, 1, 1, 2, …
$ aod <dbl> 37.36364, 34.81818, 36.00000, 33.08333, 43…
These are all currently numerical variables and R thinks they are continuous variables. However id is the monitor ID, fips is the federal information processing standard number fo the country where it is located, and zcta is the zip code tabulation area.
These should all be categorical variables and so we need to convert them to factors, which are ordered characters using the function as.factor().
pm <- pm %>%
mutate(across(c(id, fips, zcta), as.factor))Let’s dig a little bit deeper using skim().
skim(pm)| Name | pm |
| Number of rows | 876 |
| Number of columns | 50 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| factor | 3 |
| numeric | 44 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state | 0 | 1 | 4 | 20 | 0 | 49 | 0 |
| county | 0 | 1 | 3 | 20 | 0 | 471 | 0 |
| city | 0 | 1 | 4 | 48 | 0 | 607 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| id | 0 | 1 | FALSE | 876 | 100: 1, 102: 1, 103: 1, 104: 1 |
| fips | 0 | 1 | FALSE | 569 | 170: 12, 603: 10, 261: 9, 107: 8 |
| zcta | 0 | 1 | FALSE | 842 | 475: 3, 110: 2, 160: 2, 290: 2 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| value | 0 | 1 | 10.81 | 2.58 | 3.02 | 9.27 | 11.15 | 12.37 | 23.16 | ▂▆▇▁▁ |
| lat | 0 | 1 | 38.48 | 4.62 | 25.47 | 35.03 | 39.30 | 41.66 | 48.40 | ▁▃▅▇▂ |
| lon | 0 | 1 | -91.74 | 14.96 | -124.18 | -99.16 | -87.47 | -80.69 | -68.04 | ▃▂▃▇▃ |
| cmaq | 0 | 1 | 8.41 | 2.97 | 1.63 | 6.53 | 8.62 | 10.24 | 23.13 | ▃▇▃▁▁ |
| zcta_area | 0 | 1 | 183173481.91 | 542598878.48 | 15459.00 | 14204601.75 | 37653560.50 | 160041508.25 | 8164820625.00 | ▇▁▁▁▁ |
| zcta_pop | 0 | 1 | 24227.58 | 17772.16 | 0.00 | 9797.00 | 22014.00 | 35004.75 | 95397.00 | ▇▇▃▁▁ |
| imp_a500 | 0 | 1 | 24.72 | 19.34 | 0.00 | 3.70 | 25.12 | 40.22 | 69.61 | ▇▅▆▃▂ |
| imp_a1000 | 0 | 1 | 24.26 | 18.02 | 0.00 | 5.32 | 24.53 | 38.59 | 67.50 | ▇▅▆▃▁ |
| imp_a5000 | 0 | 1 | 19.93 | 14.72 | 0.05 | 6.79 | 19.07 | 30.11 | 74.60 | ▇▆▃▁▁ |
| imp_a10000 | 0 | 1 | 15.82 | 13.81 | 0.09 | 4.54 | 12.36 | 24.17 | 72.09 | ▇▃▂▁▁ |
| imp_a15000 | 0 | 1 | 13.43 | 13.12 | 0.11 | 3.24 | 9.67 | 20.55 | 71.10 | ▇▃▁▁▁ |
| county_area | 0 | 1 | 3768701992.12 | 6212829553.56 | 33703512.00 | 1116536297.50 | 1690826566.50 | 2878192209.00 | 51947229509.00 | ▇▁▁▁▁ |
| county_pop | 0 | 1 | 687298.44 | 1293488.74 | 783.00 | 100948.00 | 280730.50 | 743159.00 | 9818605.00 | ▇▁▁▁▁ |
| log_dist_to_prisec | 0 | 1 | 6.19 | 1.41 | -1.46 | 5.43 | 6.36 | 7.15 | 10.45 | ▁▁▃▇▁ |
| log_pri_length_5000 | 0 | 1 | 9.82 | 1.08 | 8.52 | 8.52 | 10.05 | 10.73 | 12.05 | ▇▂▆▅▂ |
| log_pri_length_10000 | 0 | 1 | 10.92 | 1.13 | 9.21 | 9.80 | 11.17 | 11.83 | 13.02 | ▇▂▇▇▃ |
| log_pri_length_15000 | 0 | 1 | 11.50 | 1.15 | 9.62 | 10.87 | 11.72 | 12.40 | 13.59 | ▆▂▇▇▃ |
| log_pri_length_25000 | 0 | 1 | 12.24 | 1.10 | 10.13 | 11.69 | 12.46 | 13.05 | 14.36 | ▅▃▇▇▃ |
| log_prisec_length_500 | 0 | 1 | 6.99 | 0.95 | 6.21 | 6.21 | 6.21 | 7.82 | 9.40 | ▇▁▂▂▁ |
| log_prisec_length_1000 | 0 | 1 | 8.56 | 0.79 | 7.60 | 7.60 | 8.66 | 9.20 | 10.47 | ▇▅▆▃▁ |
| log_prisec_length_5000 | 0 | 1 | 11.28 | 0.78 | 8.52 | 10.91 | 11.42 | 11.83 | 12.78 | ▁▁▃▇▃ |
| log_prisec_length_10000 | 0 | 1 | 12.41 | 0.73 | 9.21 | 11.99 | 12.53 | 12.94 | 13.85 | ▁▁▃▇▅ |
| log_prisec_length_15000 | 0 | 1 | 13.03 | 0.72 | 9.62 | 12.59 | 13.13 | 13.57 | 14.41 | ▁▁▃▇▅ |
| log_prisec_length_25000 | 0 | 1 | 13.82 | 0.70 | 10.13 | 13.38 | 13.92 | 14.35 | 15.23 | ▁▁▃▇▆ |
| log_nei_2008_pm25_sum_10000 | 0 | 1 | 3.97 | 2.35 | 0.00 | 2.15 | 4.29 | 5.69 | 9.12 | ▆▅▇▆▂ |
| log_nei_2008_pm25_sum_15000 | 0 | 1 | 4.72 | 2.25 | 0.00 | 3.47 | 5.00 | 6.35 | 9.42 | ▃▃▇▇▂ |
| log_nei_2008_pm25_sum_25000 | 0 | 1 | 5.67 | 2.11 | 0.00 | 4.66 | 5.91 | 7.28 | 9.65 | ▂▂▇▇▃ |
| log_nei_2008_pm10_sum_10000 | 0 | 1 | 4.35 | 2.32 | 0.00 | 2.69 | 4.62 | 6.07 | 9.34 | ▅▅▇▇▂ |
| log_nei_2008_pm10_sum_15000 | 0 | 1 | 5.10 | 2.18 | 0.00 | 3.87 | 5.39 | 6.72 | 9.71 | ▂▃▇▇▂ |
| log_nei_2008_pm10_sum_25000 | 0 | 1 | 6.07 | 2.01 | 0.00 | 5.10 | 6.37 | 7.52 | 9.88 | ▁▂▆▇▃ |
| popdens_county | 0 | 1 | 551.76 | 1711.51 | 0.26 | 40.77 | 156.67 | 510.81 | 26821.91 | ▇▁▁▁▁ |
| popdens_zcta | 0 | 1 | 1279.66 | 2757.49 | 0.00 | 101.15 | 610.35 | 1382.52 | 30418.84 | ▇▁▁▁▁ |
| nohs | 0 | 1 | 6.99 | 7.21 | 0.00 | 2.70 | 5.10 | 8.80 | 100.00 | ▇▁▁▁▁ |
| somehs | 0 | 1 | 10.17 | 6.20 | 0.00 | 5.90 | 9.40 | 13.90 | 72.20 | ▇▂▁▁▁ |
| hs | 0 | 1 | 30.32 | 11.40 | 0.00 | 23.80 | 30.75 | 36.10 | 100.00 | ▂▇▂▁▁ |
| somecollege | 0 | 1 | 21.58 | 8.60 | 0.00 | 17.50 | 21.30 | 24.70 | 100.00 | ▆▇▁▁▁ |
| associate | 0 | 1 | 7.13 | 4.01 | 0.00 | 4.90 | 7.10 | 8.80 | 71.40 | ▇▁▁▁▁ |
| bachelor | 0 | 1 | 14.90 | 9.71 | 0.00 | 8.80 | 12.95 | 19.22 | 100.00 | ▇▂▁▁▁ |
| grad | 0 | 1 | 8.91 | 8.65 | 0.00 | 3.90 | 6.70 | 11.00 | 100.00 | ▇▁▁▁▁ |
| pov | 0 | 1 | 14.95 | 11.33 | 0.00 | 6.50 | 12.10 | 21.22 | 65.90 | ▇▅▂▁▁ |
| hs_orless | 0 | 1 | 47.48 | 16.75 | 0.00 | 37.92 | 48.65 | 59.10 | 100.00 | ▁▃▇▃▁ |
| urc2013 | 0 | 1 | 2.92 | 1.52 | 1.00 | 2.00 | 3.00 | 4.00 | 6.00 | ▇▅▃▂▁ |
| urc2006 | 0 | 1 | 2.97 | 1.52 | 1.00 | 2.00 | 3.00 | 4.00 | 6.00 | ▇▅▃▂▁ |
| aod | 0 | 1 | 43.70 | 19.56 | 5.00 | 31.66 | 40.17 | 49.67 | 143.00 | ▃▇▁▁▁ |
Data summary feature data set for air pollution monitoring.
Next step is to look a little bit deeper at what specific information is in each column to make sure we are considering confounding variables or biases in the data set.
What states are included?
| state |
|---|
| Alabama |
| Arizona |
| Arkansas |
| California |
| Colorado |
| Connecticut |
| Delaware |
| District Of Columbia |
| Florida |
| Georgia |
| Idaho |
| Illinois |
| Indiana |
| Iowa |
| Kansas |
| Kentucky |
| Louisiana |
| Maine |
| Maryland |
| Massachusetts |
| Michigan |
| Minnesota |
| Mississippi |
| Missouri |
| Montana |
| Nebraska |
| Nevada |
| New Hampshire |
| New Jersey |
| New Mexico |
| New York |
| North Carolina |
| North Dakota |
| Ohio |
| Oklahoma |
| Oregon |
| Pennsylvania |
| Rhode Island |
| South Carolina |
| South Dakota |
| Tennessee |
| Texas |
| Utah |
| Vermont |
| Virginia |
| Washington |
| West Virginia |
| Wisconsin |
| Wyoming |
Cities with largest number of air monitors?
| city | n |
|---|---|
| Not in a city | 103 |
| New York | 9 |
| Cleveland | 6 |
| Baltimore | 5 |
| Chicago | 5 |
| Detroit | 5 |
| Milwaukee | 5 |
| New Haven | 5 |
| Philadelphia | 5 |
| Springfield | 5 |
| Boston | 4 |
| Columbus | 4 |
| Indianapolis (Remainder) | 4 |
| Louisville | 4 |
| St. Louis | 4 |
| Washington | 4 |
We should also evaluate to which extent our predictor variables are correlated.
When we use linear regressions and predictor variables are correlated they end up predicting each other rather than the outcome variable. This is generally referred to as multicollinearity.
Additionally, including redundant variables can add unnecessary noise to our model which can end up decreasing the accuracy of our precipitations and slow down our model.
Correlation among predictor variables can also make it difficult to understand which of them are actually predictive.
We’ll start by making pairwise comparisons among all of our numeric (continuous) variables using corrplot.
# calculate Pearson's coefficients
PM_cor <- pm %>%
select_if(is.numeric) %>%
cor()
# plot pairwise correlation matrix
corrplot(PM_cor, tl.cex = 0.5)Let’s re-plot that visualization using the absolute values of the Pearson correlation coefficient5 and by organizing our variables using hierarchical clustering, i.e. variables more similar to each other will be closer to each other.
5 We are not interested in the direction of the correlation only how strong it is
corrplot(abs(PM_cor),
order = "hclust",
tl.cex = 0.5,
col.lim = c(0, 1))We can now see that there are quite a few variables correlated with our outcome variable (value). Unsurprisingly, we also have sets of variables describing the characteristic of the area corresponding to the locations of the emissions values, primarily the variables that contain imp_*, *pri_*, *nei_*.
We can take a closer look at sets of variables we are interested in using the ggcorr() and ggpairs() functions from the GGally package.
Let’s start with our variables describing measuring how impervious the surface is:
# plot pairwise
pm %>%
select(contains("imp")) %>%
ggcorr(label = TRUE)And let’s look even a little bit closer look at the relationship of these variables.visualizing the relationships using scatter plots, the distribution of values for each variable using density plots, along with the correlation coefficients.
# plot scatter plots, density plots & correlation coefficients
pm %>%
select(contains("imp")) %>%
ggpairs() Next, let’s take a peak at our road density data
pm %>%
select(contains("pri")) %>%
ggpairs()Finally, let’s look at the emissions variables:
pm %>%
select(contains("nei")) %>%
ggpairs()Since we now know that the variables within these categories are correlated to each other, let’s select one of each for the same buffer size and compare how they are correlated to each other as well as population density.
pm %>%
select(log_nei_2008_pm25_sum_10000, popdens_county,
log_pri_length_10000, imp_a10000, county_pop) %>%
ggpairs()Some of our variables have extreme variability; one way to do this is to perform a log transformation which might effect our correlations so we should take a quick look.
pm %>%
mutate(log_popdens_county = log(popdens_county)) %>%
mutate(log_pop_county = log(county_pop)) %>%
select(log_nei_2008_pm25_sum_10000, log_popdens_county,
log_pri_length_10000, imp_a10000, log_pop_county) %>%
ggpairs()This does increase our correlation levels a bit but overall these variables do not appear to be highly correlated, so we should make sure keep a variable from each category to use as predictor values for our model.
Finally, let’s add an additional categorical variable that tells us whether or not a monitor is in a city - this will be helpful down the line.
pm <- pm %>%
mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
city != "Not in a city" ~ "In a city"))